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Abstract 

The ratio of divergence at nonsynonymous and synonymous sites, dN/dS, is a widely used measure in evolutionary genetic 
studies to investigate the extent to which selection modulates gene sequence evolution. Originally tailored to codon 
sequences of distantly related lineages, dN/dS represents the ratio of fixed nonsynonymous to synonymous differences. 
The impact of ancestral and lineage-specific polymorphisms on d/V/dS, which we here show to be substantial for closely 
related lineages, is generally neglected in estimation techniques of dN/dS. To address this issue, we formulate a codon 
model that is firmly anchored in population genetic theory, derive analytical expressions for the d/V/dS measure by 
Poisson random field approximation in a Markovian framework and validate the derivations by simulations. In good 
agreement, simulations and analytical derivations demonstrate that dN/dS is biased by polymorphisms at short time 
scales and that it can take substantial time for the expected value to settle at its time limit where only fixed differences 
are considered. We further show that in any attempt to estimate the d/V/dS ratio from empirical data the effect of the 
intrinsic fluctuations of a ratio of stochastic variables, can even under neutrality yield extreme values of d/V/dS at short 
time scales or in regions of low mutation rate. Taken together, our results have significant implications for the inter- 
pretation of d/V/dS estimates, the McDonald- Kreitman test and other related statistics, in particular for closely related 
lineages. 

Key words: population genetics of d/V/dS, codon evolution, genomic signatures of natural selection, Poisson random field 
approximation. 



Introduction 

The extent to which selection promotes evolutionary change 
has long been a key question in the evolutionary sciences. 
Although at the phenotypic level the importance of selection 
is widely recognized, its role in modulating evolution at the 
molecular level remains debated (Nei et al. 2010). One pop- 
ular indicator of selection acting on protein-coding DNA 
sequences is the dN/dS ratio. Because of its alleged simplicity 
and intuitive appeal, this measure has a strong tradition in 
evolutionary research, notably for the identification of genes 
with a history of positive selection (Nielsen 2005). In short, the 
dN/dS ratio quantifies the mode and strength of selection by 
comparing synonymous substitution rates (dS) — assumed to 
be neutral — with nonsynonymous substitution rates (dN), 
which are exposed to selection as they change the amino 
acid composition of a protein. Unity of the ratio is generally 
taken to indicate neutrality, values exceeding unity are 
interpreted as selection promoting change (positive 
selection), and values less than one are usually taken as an 
indication for selection suppressing protein change (purifying 
selection). 

Originally the dN/dS ratio was developed in a phyloge- 
netics context, and its estimation was based on codon 
sequences of distantly related lineages, where it is reasonable 
to assume that dN/dS represents the ratio of fixed 



nonsynonymous to synonymous differences between lineages 
(Miyata et al. 1980; Li et al. 1985; Nei and Gojobori 1986; 
Goldman and Yang 1994; Muse and Gaut 1994). The dN/dS 
ratio can then be approximated as a deterministic function 
of population size N and the selection coefficient s (Nielsen 
and Yang 2003; Kryazhimskiy and Plotkin 2008). However, 
recent empirical (Wolf et al. 2009) and theoretical work 
(Kryazhimskiy and Plotkin 2008; Peterson and Masel 2009) 
has challenged whether dN/dS appropriately reflects the out- 
come of selection across all relevant evolutionary time scales. 
As soon as we leave the realm of phylogenetics where single 
stereotypic genomes are compared and enter the realm of 
population genetics, the dN/dS ratio is no longer based on 
only fixed nonsynonymous versus synonymous differences. 
Segregating polymorphisms can substantially alter estimates 
of divergence and consequently estimates of dN/dS (Peterson 
and Masel 2009; Charlesworth 2010). As a consequence, 
both recently arisen lineage-specific variants as well as 
shared ancestral polymorphisms need to be taken into 
account. Kryazhimskiy and Plotkin (2008) theoretically inves- 
tigated the two most extreme cases in timescale considering 
1) the pure phylogenetics context, where the dN/dS ratio is 
based on fixed differences between distantly related lineages 
and 2) the pure population genetics context, where the 
dN/dS ratio is based on segregating polymorphisms within 
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one panmictic population of conspecific individuals. In a phy- 
logenetics context, codon evolution is modeled as a Markov 
process, which indirectly assumes that fixation of a mutation 
in the population occurs instantaneously (Goldman and Yang 
1994; Muse and Gaut 1994). In a population genetics context, 
simulations of sequence evolution in the presence of selection 
are often based on Wright-Fisher sampling (Wright 1931). 
Under the assumptions that 1) codons evolve independently 
of each other under free recombination, and 2) polymorphic 
codon positions are not allowed to mutate further, allowing 
for a maximum biallelic state, the Markov model of codon 
evolution can be viewed as a time limit of a Wright-Fisher 
population process. That is, the jump rates of the Markov 
model of codon evolution can be interpreted in terms of the 
fixation probability of a Wright-Fisher population process 
with selection (Nielsen and Yang 2003). However, for closely 
related lineages the assumption that divergence time is large 
enough to view the Markov model of codon evolution as a 
time limit of a Wright-Fisher population process will be 
violated. This violation gives rise to a gap between pure 
population genetics and phylogenetic modeling approaches, 
neither of which can adequately capture the evolutionary 
relevant, temporal dynamics of the dN/dS ratio. 

To fill this gap, we anchor the dN/dS ratio firmly in 
population genetics theory and develop a codon substitution 
model that allows us to describe the continuous dynamics 
of dN/dS across evolutionary time starting from a single 
panmictic population followed by a speciation event eventu- 
ally resulting in deep phylogenetic divergence (fig. 1). 
We derive analytical expressions for the dN/dS measure in a 
Poisson Random Field framework integrating the relative 



contributions of ancestral polymorphisms, lineage-specific 
polymorphisms, and fixed differences through time. Our anal- 
ysis shows under which evolutionary conditions, namely 
population size, selection coefficient, and mutation rate, poly- 
morphisms influence the expectation for dN/dS at any given 
time point after speciation. The comprehensive mathematical 
description of dN/dS based on a ratio of stochastic variables 
further allows to estimate the associated variation and reveals 
that for recently diverged lineages stochastic forces acting on 
dN/dS are not negligible. In that, we provide a null model 
making apparent the inherent biases in the estimation of 
dN/dS generating false positive inference in the study of adap- 
tive evolution. The results do not merely affect estimation of 
the dN/dS ratio itself, but are of likewise importance for 
related statistics such as the McDonald-Kreitman test 
(McDonald and Kreitman 1991) or the a-estimate (Smith 
and Eyre-Walker 2002). We finally advocate that a combina- 
tion of divergence and polymorphism data be used to esti- 
mate true dN/dS ratios and associated confidence intervals for 
closely related lineages, something that appears to be feasible 
in light of the current progress in sequencing technology. 

Results 

Review of the Classical Definition of dN/dS 
Over long time scales, selection is generally inferred from 
evolutionary change between divergent lineages that arose 
after a distant population split or speciation event. Each 
lineage is then represented by one stereotypic genome 
sequence, where sequence comparison allows quantifying 
evolutionary change at orthologous positions. Here, 
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Fig. 1. Scheme of the evolutionary model — Speciation occurs instantaneously at time 0 and the two populations evolve separately and do not 
interbreed until present time t. Mutations are depicted by black dots in the right part of the graph, where the arrows from right to left point to the 
population in which the mutation happened. The right part of the graph shows the path to absorption (fixation or extinction) of these mutations. The 
blue lines show paths of mutations that occurred before speciation, where paths evolve separately after speciation (ancestral polymorphism). At time 0, 
these mutations constitute shared polymorphisms. The red lines show paths to absorption of lineage-specific mutations that occurred and got 
absorbed in one of the two populations between [0, t] (fixed differences). The green lines show paths of lineage-specific mutations, which at present 
time t are still segregating in the respective population (lineage-specific polymorphism). 
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protein-coding sequences offer the great possibility that they 
allow to contrast nonsynonymous and synonymous changes, 
which yields intuitive insight into the mode and strength of 
selection. The problem of estimating the strength of selection 
then essentially becomes a problem of estimating substitu- 
tion rates for two classes of changes. This has been addressed 
by two sets of methods, heuristic counting methods (Miyata 
et al. 1980; Nei and Gojobori 1986) and maximum likelihood 
based approaches (Goldman and Yang 1994; Muse and Gaut 
1994), the latter modeling the substitution process as a 
continuous-time Markov process with 61 possible states 
corresponding to the 61 sense codons. Under an infinite 
sites model and under the assumptions of free recombination 
and instantaneous fixation of novel mutations, dN/dS can be 
recaptured based on Kimura's expression for the probability 
of fixation of newly arising variants of frequency 1 /N (Sawyer 
and Hartl 1992). To apply Kimura's expression for the prob- 
ability of fixation under a finite sites model, we have to make 
the additional assumption that polymorphic codon positions 
are not allowed to mutate further, such that there are never 
more than two alleles segregating at the same codon position 
(Nielsen and Yang 2003). It is then considered that synony- 
mous mutations evolve in the population under neutral 
reproduction. Nonsynonymous mutations are influenced 
by selective forces in such a way that the fitness of the 
ancestral to the derived alleles are 1 to 1 + s, where all nonsyn- 
onymous mutations have the same selection coefficient. 
Each time a derived allele becomes fixed fitness is reassigned 
such that the derived allele is considered as the new ancestral 
state and fitness is set to 1. The fitness of any potential 
new mutation (including back mutation) is set to 1 + s. 
Under this model, selection acts as a mechanism which 
promotes (s > 0) or prevents (s < 0) changes of codons 
involving amino acid replacements, and does not represent 
a preference for or against specific codon types (for discussion 
see Nielsen and Yang [2003]). If the chance of an immediate 
back mutation is small enough to be neglected, then for 
nonsynonymous mutations the fixation probability is given 
by (Kimura 1962), 

1 - e~ 2s _ 1 - e- 2 Y' N _ 1 2/ 
qy ~ 1 - e- 2Ns ~ 1 - e- 2 r % N 1 - e~ 2 v ' (1) 

where N is the (effective) population size, s is the selection 
coefficient, and y = Ns is the population-scaled selection 
coefficient. By taking y — > 0, we recover the neutral case of 
a synonymous mutation, for which the probability that a 
novel mutation gets fixed in the population is q 0 = 1/N. 
Note that derivations of the probability of fixation are 
based on a haploid population of size N. Under the assump- 
tion of additive fitness effects in a diploid organism, these 
derivations are equivalent to a diploid population of size 
N/2. The expected numbers of nonsynonymous and synon- 
ymous substitutions per generation scale with q Y and q 0 , and 
hence dN/dS is interpreted as the ratio of these given by 

e>y = — ^ 1 _ e _ 2y , 7^0, w 0 = 1. (2) 



Here, co Y corresponds to the a> typically estimated from data 
using software packages such as PAML (Yang 2007). One 
objection to bear in mind is that fixation effects were assumed 
to be instantaneous, where sequence divergence is equal to 
the number of mutations that occurred and got fixed 
between two populations after population split. In practice, 
however, the sequence divergence of two populations is mea- 
sured based on the number of differences observed at diver- 
gence time t between two sequences each sampled from one 
of the two distinct populations. Thus, in addition to muta- 
tions that occurred and got fixed after population split also 
shared ancestral and newly arisen lineage-specific polymor- 
phisms will contribute to the total divergence (fig. 1). 

Definition of dN/dS in a Population Genetics- 
Phylogenetics Framework 

We will now drop the assumption of instantaneous fixation 
and formulate an explicit codon substitution model that 
allows to describe the expectation of nonsynonymous and 
synonymous divergence at any point in time. Following the 
standard approach, dN/dS is derived from amino acid 
sequence divergence between two divergent lineages (or 
populations). We make the simplifying assumption that 
speciation follows an isolation-without-migration model as 
illustrated in figure 1. We thus consider two independent 
populations both of size N where each element is a sequence 
of L codons or 3L nucleotide sites. Let t denote the popula- 
tion-scaled evolutionary divergence time between the two 
populations at present time, where Nt generations have 
passed since population divergence time at t-0. Mutation 
events occur at a rate /x > 0 per individual (nucleotide) site 
and generation. Whenever a nucleotide is hit by mutation, a 
target nucleotide is chosen according to a 4 x 4 Markov 
chain transition probability matrix H. Note that H can be 
viewed as any commonly used nucleotide substitution 
model, such as the Jukes-Cantor model. The fate in the pop- 
ulation of this newly introduced derived type nucleotide over 
subsequent generations is extinction or fixation, determined 
by a standard Wright-Fisher reproduction mechanism, which 
furthermore distinguishes between nonsynonymous and 
synonymous changes. We assume that codons evolve inde- 
pendently (free recombination) and that [i is sufficiently 
small to allow for a scenario where each new mutation will 
only affect monomorphic codon sites. Hence, in this model, 
codon sites will be at most biallelic. 

In each polymorphic site, the pair of ancestral and derived 
codon will be either synonymous or nonsynonymous. 
Mutations leading to synonymous changes evolve in the pop- 
ulation under neutral reproduction, whereas mutations lead- 
ing to nonsynonymous changes are influenced by selection 
such that the fitness of the ancestral to the derived alleles are 
1 to 1 + s. Following practice in much of the population ge- 
netics literature including theoretical studies of dN/dS 
(Sawyer and Hartl 1992), we consider the diffusion approxi- 
mation scaling regime of large N and small s, where the pop- 
ulation-scaled selection coefficient y = Ns reflects the total 
(signed) selection pressure per site and generation. Similarly, 
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the constant 6 = 3fiLN measures total mutation pressure 
per codon sequence and generation. In the Materials and 
Methods, we show that based on the fundamental parame- 
ters H, y, and 9 together with the knowledge of the genetic 
code, it is possible in the framework of our model to derive 
the proportions of mutation events leading to synonymous 
and nonsynonymous codon pairs. Hence, we can write 

0 = $ S y n + $ n0 n + ^stop i 

to distinguish synonymous and nonsynonymous changes, as 
well as sorting out an intensity 8 stop for events that lead to 
stop codons. 

Now, to obtain dN/dS, we consider the sequence diver- 
gence between two sequences sampled from two distinct 
populations or lineages. Sequence divergence can be split 
into the two contributions of nonsynonymous and synony- 
mous divergence and their ratio can be used to quantify the 
impact of selection acting on the entire coding sequence. To 
this end, we let D non (t) denote nonsynonymous divergence 
and D syn (t) synonymous divergence at time t, and write 
D(t) = D non (t) + D syn (t) for the total divergence. Sequence 
divergence should naturally be proportional to sequence 
length and increase over time essentially with the same rate 
as that of substitutions occurring in either population from 
the time of population split and onward. This is indeed a 
property of our model, in which D non (t) and D syn (t) are in- 
dependent and have approximate Poisson distributions with 
expected values 

ED non (t) = 29 non d non (t) and ED syn (t) = 20 syn d syn (t), 

where d non and cf yn are functions of y and t. The factor 2 
arises from the fact that we consider divergence between two 
populations. 

As a first measure of dN/dS we take the ratio of expected 
values 

ED^ (t)/0non= ^ 

' ED syn (t)/0 syn d syn (t) 

where the normalization by 8 non and 6* syn accounts for the 
difference in mutation pressure for nonsynonymous and syn- 
onymous changes, respectively. The ratio in equation (3) rep- 
resents a measure of the average rates of nonsynonymous to 
synonymous divergence. Similar to previous work (Rocha 
et al. 2006; Kryazhimskiy and Plotkin 2008; Peterson and 
Masel 2009), our aim is to understand the relation between 
dN/dS and natural selection as a function of evolutionary 
time. Our contribution here is to provide more detailed 
expressions than reported earlier for d non (t) and d syn (t) 
across all relevant evolutionary timescales with special atten- 
tion to small t. At the same time, we are cautious about the 
use of equation (3) as a single dN/dS measure. After all, upon 
accepting the underlying model assumption that divergence 
is the result of random sampling from random populations of 
random sequences, it is restrictive in the end to only compare 
two expected values. To initiate a discussion of alternative 
measures of dN/dS, perhaps more suitable to reflect the 
various fluctuations involved, we will compare in the next 



subsection the ratio of the expected values ED non (t) and 
ED syn (t) in equation (3) with the expected value of the 
ratio of the independent random variables D non (t) and 
D syn (t), see equation (7) later. 

To incorporate the contribution of polymorphism and 
thus expand the classical definition of dN/dS we consider 
three levels of mathematical modeling which are described 
in detail in the Materials and Methods. In short, we begin 
with a full codon substitution model (the phylogenetics 
component) embedded in a population genetics framework 
represented by a discrete Markov chain. Because of the 
complexity of the model, analytical insight is limited. In a 
second step, we therefore resort to an analytically more 
tractable continuous time approximation. This allows us to 
find the codon equilibrium distribution and to extract the 
typical rates of synonymous and nonsynonymous mutations. 
For standard mutation models, the latter can be derived 
explicitly. Finally, in a third step, we argue that key properties 
including the rate of divergence over time are captured well 
by approximate Poisson distributions. This approach is rem- 
iniscent of the Poisson's random fields model, which has been 
used for similar purposes earlier (Sawyer and Hartl 1992). The 
main assumptions for the model parameters are that N and L 
are both large while the ratio 20log(N)/L, which represents 
the fraction of polymorphic sites in the sequence, is kept 
sufficiently small. From this, we derive detailed results for 
the dN/dS ratio, in particular with regards to dependence 
on the selection parameter y and divergence time t 

As a consequence of the Poisson approximation, we can 
treat nonsynonymous as well as synonymous divergence 
as the sum of three independent Poisson distributed 
components, arising from divergence due to fixation of new 
mutations since population divergence, lineage-specific poly- 
morphisms, and shared ancestral polymorphisms. This basi- 
cally corresponds to sampling two sequences, one sequence 
from each population, aligning them and counting the 
number of synonymous and nonsynonymous differences. 
We then distinguish three cases how these differences 
could have arrived. First, we distinguish mutations which oc- 
curred before or after population split. Second, for mutations 
that occurred after population split we make the further dis- 
tinction whether the mutant is already fixed in its population 
or still segregating. The first case of mutations, which occurred 
before population split, are referred to as ancestral divergence 
(blue lines in fig. 1). Fixed differences due to mutations that 
occurred after population split are referred to as fixed diver- 
gence (red lines in fig. 1). Finally, mutations, which occurred 
after population split and are still segregating, are referred to 
as polymorphic divergence (green lines in fig. 1). Accordingly, 
the mean divergence splits into three types of contributions, 
and we can write 

Ht) = 0)+«Ci(t)+C(t). 

(4) 

where d^° n (t), d^"(t) represent divergence due to fixation of 
new mutations since population divergence, d"°"(t), cCJ(t) 
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are contributions from sampling of lineage-specific polymor- 
phic sites, and d™"(t),di* n c (t) take into account the effect 
of ancestral polymorphisms which existed at t = 0. In the 
Materials and Methods, we present in detail approximation 
formulas for all of these, using three auxiliary functions which 
we denote by C y (t), H y (t), and ] y (t), Briefly, t — C y (t) scales 
with the average number of fixations up to time t and H Y (t) 
scales with the average number of lineage-specific polymor- 
phisms that get sampled for its derived allele at time t, in both 
cases referring to mutations that occurred after population 
split. The function J Y (t) scales with the average number of 
sampled differences at t, which originate from mutations in 
the ancestral population. 

With time-explicit derivations of all terms in equation (4) 
established, we may sum up the expected divergence from 
each contribution, be it ancestral, polymorphic, or fixed, and 
compare nonsynonymous and synonymous terms, writing 

j ii , rf" on (0 , , t-C y (t) + H y (t)+J y (t) 
dN/dS | total = ^ = co Y t _ Co(t)+min(t>2)+1 ■ 

(5) 

The asymptotic limits are given by a> y as t — > oo and 
by (a> y - 1)/y as t -> 0. 

To help interpret the various contributions in equation (5), 
we proceed to look at the separate dN/dS ratios for each type, 
which we denote by dN/dS | fixation , dN/dS | p 0 | ymorphic , and 
dN/dS | anC estrai- Tn e ratio d^° n (t)/d^"(t) is insensitive to t 
and quickly converges to a> y with increasing t. Hence, in 
agreement with equation (3), the contribution to dN/dS 
due to fixations of lineage-specific mutations is 



dN/dS 



fixation 



COyjt ~ Cy(t)) 

t - C 0 (t) 



2y 



The slight deviation of d™ n (t)/d^ n (t) from co y is essentially 
due to our definition of fixed differences, which are based on 
lineage-specific mutations only. In fact, fixed differences could 
arise due to lineage-specific mutations as well as due to 
shared ancestral polymorphisms. However, once the sum of 
the various distributions to divergence estimates is computed 
(as in dN/dS\ tota |), both kinds of fixed differences are 
considered. For divergence attributed to lineage-specific 
polymorphisms, we find that d p °"(t)/d p y ^(t), which equals 1 
at t-0, quickly approaches a limiting value p yi such that 



dN/ds\ pdymorphlc = d;™(t)/d^(t) 



COyHy(t) 

min(t,2) 



-2y 



1 + 2y + 2y 2 



2y(1 - e- J y) 



t — >■ oo. 



Finally, the ancestral contribution as time evolves has a 
limiting ratio <p y such that 

dN/dS | ancestral = dZ(t)/dZ(t) 

= COyJy(t) <fty « COyO - | - ^) % 1 + - ^ . 



At t-0, the ancestral ratio takes into account the effect of 
selection when we sample two individuals from the single 
population, which forms the common ancestry at population 
split. In equation (26) of the Materials and Methods, we 
show that J y (Q) = {co y — X)lyay y , so that we have the initial 
ratio 



O°)/O0) 



My 



1 



1 + 



Y 



Y 



YL 

45 



(6) 



The time dependence of the various ratios is illustrated in the 
top panel of figure 2 for y = — 1. The three separate ratios 
dN/dS | fixation , dN/dS | po | yrnorphic , dN/dS | ancestra | are plotted 
together with the ratio of total expectations dN/dS | tota[ , 
as well as the limiting value co Y . In the bottom panel of 
figure 2, the limiting long time ratios co y , p y , and (p y as 
well as the initial ancestral contribution (co y — 1)/y are plot- 
ted together as functions of y. 

The interesting observation is that when combining the 
effects of all three contributions by forming dN/dS | tota! , the 
ancestral and polymorphic divergence influence the total di- 
vergence ratio over a substantial time period before settling 
down at co y . With an increasing number of fixation events 
and hence actual lineage-specific nucleotide substitutions 
building up differences between the two populations over 
a considerable time span, it is of course the rate of linear 
increase in d£° n (t) in comparison with that of d^"(t), which 
will ultimately decide the asymptotic dN/dS ratio. But as 
evident in figure 2, ancestral and lineage-specific polymor- 
phisms which also generate differences between the observed 
sequences seek out their own preferred balance of nonsyn- 
onymous to synonymous change. As long as ancestral and 
polymorphic differences measure up on the scale of fixations, 
the limiting numbers p y and 4> y influence the total ratio. 
The ancestral initial value, which is manifestly different from 
a) y , ensures that the transition to fixation asymptotics is 
clearly visible on the evolutionary time scale. In summary, 
this clearly indicates that, indeed, dN/dS is naturally a function 
of time. For the case y < 0 of negative selection, dN/dS 
decreases from its initial ratio (a) y — < 1 to co y < 1. 
If y > 0, then dN/dS increases from its initial ratio 
(co y — 1)/y > 1 to o)y > 1. Figure 3 illustrates the time 
dependence of dN/ dS \ tota | for five values of y. 

We conclude this section with additional remarks on the 
relation of our results to previous results in the literature. The 
work of Kryazhimskiy and Plotkin (2008) is concerned with 
the relationship between selection and dN/dS values mea- 
sured from two sequences sampled from a single population. 
In this situation, differences between the sequences reflect 
segregating polymorphisms and not fixed differences. The 
authors offer a theoretical foundation of the dN/dS ratio for 
single population data and demonstrate that the frequent use 
of equation (2) in the context of intraspecific sequence data 
lacks proper justification and is inappropriate. The desired 
dN/dS ratio addressed in (Kryazhimskiy and Plotkin 2008) is 
closely related to dN/dS | ancestral at t = 0 in our model, that is, 
the dN/dS ratio based on two sequences sampled from a 
single (ancestral) population existing prior to speciation. 
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dN/dS|fix ' 



dN/dS|pol 



dN/dS|anc 



dN/dS|total 0) y 



s/j 
13 




1 ' 1 

10 20 
Divergence time f [N generations] 



30 



3 I 



1 - 




Selection coefficient y 

Fig. 2. (Top) dN/dS ratios for fixation (red), polymorphic (green), ancestral (blue), and total (gold line) divergence for y = — 1 as a function of 
divergence time t; (bottom) the limiting values co y for dN/dS | fixation (red), p y for dN/dS | polymorphic (g reen )/ <Py f° r dN/dS | ancestral ( so ''d blue), and the 
initial ancestral contribution (coy — 1)/y (dashed blue) as a function of y. 



In fact, the initial ratio found to be (w Y — X)ly in equation 
(6) is a stationary d/V/dS ratio for sequences sampled in a 
single population. However, the corresponding quantity 
(y,6) obtained in (Kryazhimskiy and Plotkin 2008), equa- 



OJ 



pop 



tion (5), depends not only on y but also on 9. This fact can be 
traced back to the derivation of Lt>(y,f9) pop , which is based on 
a Wright-Fisher approximation that allows for back and forth 



mutations during the segregating phase, further assuming 
that 6 is sufficiently small. It is then natural to interpret the 
8 — > 0 limit of a)(y,6) pop as a generic d/V/dS ratio within 
populations, and straightforward to check that Lu(y,0) pop = 
(o)y — X)/y in complete agreement with equation (6). 

The work by Sawyer and Hard (1992) form the theoretical 
basis for the McDonald-Kreitman test (McDonald and 
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Kreitman 1991). These authors provide sampling formulas for 
fixed nonsynonymous and synonymous differences and for 
nonsynonymous and synonymous polymorphisms in a 
model setting, which is rather close to the model advocated 
in the present work. Regarding mutations, our rates # syn , f? non , 
which are derived from the codon substitution model, can be 
considered equivalent to the mutation rate parameters /x s 
and /j. r used by Sawyer and Hartl. Turning to the expected 
number of fixed differences Sawyer and Hartl assume linearity 
in time, whereas we get refined expressions for the number of 
fixed differences as we distinguish between fixed differences 
originated from shared ancestral polymorphisms and fixed 
differences originated from lineage-specific mutations. 
Our expressions are d%"(t) + cff n n c (t) = t - C 0 (t) + 1 and 
< n (t) + d a n n ° c n (t) = w y {t - C y (t) +J v (i)) compared with t 
and Wyt in (Sawyer and Hartl 1992), equation (13). 
However, the assumption of linearity in time is well justified 
for distantly related lineages and critical only for closely 
related lineages, where ancestral fixed differences measure 
up on the scale of lineage-specific fixed differences. Turning 
to the sampling formulas for nonsynonymous and synony- 
mous polymorphisms Sawyer and Hartl consider arbitrary 
samples of size n and m from two species. In our settings, 
m = n = 1 as typical in phylogenetic approaches. Their 
results (Sawyer and Hartl 1992), equations (17) and (18), 
with m = 1 correspond to 
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where we derive the time dependent functions min(t,2) 
and (OyHyit), see equations (21) and (22). The differences 
between our results and the results by Sawyer and Hartl 
arise from the fact that Sawyer and Hartl assume that the 
number of lineage-specific segregating sites has reached its 
equilibrium. Although this assumption is well justified for 
distantly related lineages, it is clearly violated for more closely 
related lineages. Our results better capture the reality of 



divergence between evolutionary young lineages, and 
converge to the results by Sawyer and Hartl for t — > oo. 

Statistical Properties of dN/dS 

The previous section has treated the time dependence of the 
dN/dS ratio, measured as the ratio of expected values of two 
independent Poisson random variables. However, when the 
ratio of nonsynonymous to synonymous divergence is esti- 
mated from sequence data, we in fact do not know their 
expected values but rather carry out single observations of 
D non (t) and D syn (t) and consider the ratio of these. Regardless 
of the estimation procedure, for example, counting methods 
or maximum likelihood approaches, an estimation based on 
sequence data will always just reflect a single observation or 
measurement. This is important to notice, as the ratio of 
expected values is in general not equal to the expected 
value of a ratio. We are therefore interested in the statistical 
properties of the expected value of a ratio of two Poisson 
random variables rather than in a ratio of expected values. 
Proper statistical inference therefore must take into account 
the natural fluctuations of such a ratio of random variables. 
This leads us to studying the ratio of nonsynonymous to 
synonymous divergence in a population genetics framework 
as the scaled ratio of Poisson variables 



D n ° n (t)/0n 



on D syn (t) > 0. 



D syn (t)/0 S yn 

Thus, we define a new measure of dN/dS as the conditional 
expectation 



dN/dS(t) = Ef ^y™" I D syn > o) 



d" on (t) 
d s v n (t) 



(7) 



C(20 syn cf yn (t)), 



which is based on a series approximation of the expected 
value of a ratio of two random variables. In the Materials 
and Methods, we introduce function C (eq. 28) and provide 
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a detailed derivation of equation (7) and show that this func- 
tion can be easily computed numerically. Note, however, 
that unlike the ratio dN/dS | totah dN/dS(t) depends on the 
mutation pressure 6. In the limit 6 —> oo (which could 
be reached by an infinitely long codon sequence L — > oo), 
we recover the previous ratio of expected values, as 
dN/dS(t) = dN/dS | 

total- To visualize the difference 
between the two measures, the upper panel of figure 4 illus- 
trates the general shape of the curve dN/dS(t) up to time 
t= 100 after population split for the case y = — 1 with the 
Jukes-Cantor mutation model and four different values of 0. 
Also shown in the same graph is dN/dS | tota! , that is, the ratio 
of expectations d non (t)/cf yn (t), and the limit co Y » 0.313 as 
t — > oo. The distinct change in appearance of the dN/dS 
curves with varying values of 9 is somewhat similar to 
what the effect would be of changing the time scale from t 
to 9t. Of course this comes natural since lowering the 
overall mutation pressure in the model would cause the 
system to run on a slower time scale. It is important to 
note that the striking deviation in figure 4 of the expected 
ratio dN/dS(t) from the ratio of expectations d non (t)/d syn (t), 
is not directly an effect of the selection mechanism. On 
the contrary, the lower panel of figure 4 shows the 
corresponding set of curves for the neutral case y = 0 for 
which we have d non (t) = d syn (t). 

To provide insight into the shape of the dN/dS curves and 
their intrinsic random variations, we further estimate upper 



and lower confidence bands / + (a) and /_(«) for dN/dS de- 
fined as a ratio of two Poisson random variables, such that 

p( U a /2) < D "° n(t)/0non < l + (a/2)) « 1 -a. 

In the Materials and Methods, we obtain 



/-(«/2) 
/ + (a/2) 



P + Za/2 



P ~ Za/2 A 



p(i -pY 



p(i - p) 




(8) 



where z a is the a-quantile of the standard normal distribu- 
tion, 7t non and 7T syn represent the proportion of nonsynony- 
mous and synonymous changes, respectively, and 



7r syn c/ syn (t) 



f, = Tr^d^Ct) + 7r non d non (t). 



To cross-validate these derivations, we next explore the 
variation in dN/dS by 100 independent simulation runs 
with parameter settings N = 500, L = 2000, y = — 1, and 
0=1. We keep track of the precise numbers of nonsynon- 
ymous and synonymous differences between two popula- 
tions from population split at t-0 up to evolutionary time 
t=100, and then plot the ratio of these differences at a 
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Fig. 4. dN/dS(t) for 9 equal to 0.1 (red), 0.2 (green), 0.4 (blue), 1 (cyan) compared with the ratio dN/dS | mai (gold). (Top) dN/dS(t) for / = — 1 and 
(bottom) dN/dS(t) for y = 0. The limiting value w Y is indicated by the dashed black line, which in the lower panel is identical to unity and hidden by 
the golden line. 
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resolution of every 5N generations. The distribution and 
stochastic fluctuations in D non (t)/D syn (t) are visualized in 
figure 5. Also shown in figure 5 are dN/dS | tota |, the dN/dS- 
ratio in equation (7) as well as the shape of the confidence 
bands (eq. 8). The confidence bands represent a 5% chance of 
seeing larger fluctuations at a specific point in time and do 
not provide joint confidence for the entire function over time. 
For the present simulation of 100 runs, 14.3% of the data 
points fell outside of the region between upper and lower 
bands, where 9.7% fell above the upper band and 4.6% below 
the lower band. Moreover, note that during initial divergence 
extremely high dN/dS values that would be commonly taken 
as evidence for positive selection are frequently obtained even 
under negative selection pressure. 

Discussion 

Value and Limitations of the Model 
The standard phylogenetic model of codon evolution and the 
estimation of dN/dS was introduced in a pure phylogenetics 
context in 1994 by two independent publications (Goldman 
and Yang 1994; Muse and Gaut 1994). The observations that 
dN/dS can be influenced by mutation rate (Wyckoff et al. 
2005), branch length (Wolf et al. 2009), and polymorphisms 
motivated theoretical studies on the temporal dynamics of 
the measure. We here pick three previous studies of relevance 
and briefly discuss them in relation to our current approach. 
The first study by Rocha et al. (2006) describes the time 
dependence of dN/dS for closely related taxa, starting with 
a clonal population which over time becomes more diverse. 
As in our modeling approach their simulation study applies 
Wright-Fisher sampling in a population of fixed size where 
each generation is subject to mutation. However, instead of 
incorporating a full codon model each mutation is simply set 
to be synonymous with probability 1 /4 and nonsynonymous 
with probability 3/4. The number of synonymous mutations 
is assumed to increase linearly in time, while nonsynonymous 
mutations are sampled with a selective weight. Importantly, 
the number of accumulated nonsynonymous mutations is 
assumed to reach a limiting value over time which means 
that possible fixations are not taken into account. Although 
this may not be critical for short time periods, for long time- 
scales this leads to the inappropriate property that 
dN/dS — > Oast^- oo. Another model developed for similar 
purposes by Peterson and Masel (2009), is refined in several 
ways. As in our study, Peterson and Masel consider divergence 
between two populations after population split from a 
common ancestor and derive estimates of the expected di- 
vergence as function of divergence time. They include the 
effects of recent fixations and shared ancestral polymor- 
phisms, but neglect the effect of lineage-specific polymor- 
phisms. Their study was motivated by earlier studies on the 
effect of ancestral polymorphisms on estimates of mutation 
rate for closely related lineages, related to the apparent mu- 
tation rate acceleration (Ho and Larson 2006; Balbi and Feil 
2007). A third study closely related to ours is the study by 
Kryazhimskiy and Plotkin (2008). Their emphasis is on the 
comparison of two extreme cases, where dN/dS is estimated 



from sequences of 1) conspecific individuals and 2) distantly 
related lineages. We expand on their approach as we study 
the continuous transition between these two cases. Ideally, 
our description of dN/dS would show the same initial value as 
the one described by Kryazhimskiy and Plotkin for conspecific 
individuals. At first glance, this is not the case. Kryazhimskiy 
and Plotkin use a mutation model that allows for back and 
forth mutation between the ancestral and the derived allele 
during the time of segregation, which in the setting of codon 
evolution seems to be inappropriate and yields different 
results. However, if we no longer allow for back and forth 
mutation by letting 9 — > 0 in Kryazhimskiy and Plotkin, 
their dN/dS measure converges to our measure based on a 
single population prior to speciation. Hence our analysis of 
dN/dS with regards to the single population prior to specia- 
tion is consistent with that of Kryazhimskiy and Plotkin 
for conspecific individuals not allowing for back and forth 
mutations. The second extreme case investigated by 
Kryazhimskiy and Plotkin also reflecting the classical defini- 
tion of dN/dS as introduced in the pure phylogenetics context 
(Goldman and Yang 1994; Muse and Gaut 1994) is in full 
agreement with our description of the limiting value of 
dN/dS for t -> oo. 

A further novelty of our work in comparison with Rocha 
et al. (2006), Peterson and Masel (2009), Kryazhimskiy and 
Plotkin (2008), or any other study investigating the temporal 
dynamics of dN/dS, is that we incorporate a full codon sub- 
stitution matrix into our model, and consider selection for or 
against changes in the codon sequence. This is in contrast to 
the other works where estimates of dN/dS are based on a 
comparison of sites evolving under selective pressure versus 
neutrally evolving sites. The slightly more complicated, pop- 
ulation genetic Markov model of codon sequence evolution 
seems a natural choice as it closely mimics biological reality. 
Moreover, our model allows to capture the dynamics of dN/ 
dS at any point in time and expands its inferential value 
beyond mere phylogenetic considerations. In addition, the 
incorporation of a nucleotide substitution matrix and the 
resulting codon substitution matrix in a Markovian frame- 
work, should make it possible to specifically consider pro- 
cesses such as GC-biased gene conversion that are known 
to mimic the signature of selection (Berglund et al. 2009). 
Several other expansions of our model are conceivable. For 
its basic formulation, we restricted our model to instanta- 
neous speciation not allowing for the occurrence of gene 
flow during the onset of divergence. We expect that under 
such an isolation-with-migration scenario the bias introduced 
by polymorphisms will extend for even longer times and 
would certainly be worth exploring. Besides, other less strin- 
gent model assumptions such as site-specific variation in 
selection strength or selection on synonymous changes via 
codon usage bias might be relevant to consider. 

Implications for Empirical Evolutionary Genetics 
Studies 

The dN/dS measure is commonly used to 1) disclose evolu- 
tionary processes across species (Wright and Andolfatto 2008; 
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Ellegren 2009) or 2) to identify genes under positive selection 
for an evolutionary lineage of interest (Clark et al. 2003; 
Bustamante et al. 2005). We here demonstrated that dN/dS 
is biased for the comparison of evolutionary young lineages 
when using the standard (phylogenetic) model. Is this time 
dependence of dN/dS at all relevant for the paramater space 
empirical work is generally dealing with? Let us first consider 
the former case, where genome-wide mean dN/dS is used as a 
proxy for average selection pressure in specific lineages that 
are then related to life history traits of remnant species 
(Nikolaev et al. 2007; Wright and Andolfatto 2008). 
According to our model, we expect a clear upward bias of 
dN/dS estimates for short branches, as has been indicated by 
empirical evidence (Wolf et al. 2009). As branch length and 
life history traits such as body size or generation time are 
known to covary (Gillooly et al. 2005; Bromham 2009), this 
artifact may lead to erroneous conclusions. But how closely 
do lineages need to be related for this to be of concern? Let us 
consider the case of human-chimp divergence as an example. 
For a realistic value of y = — 1 and considering a large 
enough sequence length L that dN/dS can be approximated 
by the ratio of expected values, our results suggest that dN/dS 
is on average upward biased by approximately 46% 2 N e gen- 
erations after speciation, and still by approximately 14% 10 N e 
generations after the split. Assuming 5 million years for the 
split time between human and chimp from a common 
ancestor, an overall generation time of 20 years for the 
human lineage and a minimum effective population size of 
14,000, we obtain an estimated time to the most common 
recent ancestor of approximately 18 N e generations. At first 
sight, this suggests only a mild contribution of polymorphisms 
to dN/dS of the human lineage. Eighteen N e generations, 
however, are an overestimate for two reasons. First, ancestral 
population sizes have been larger than current human effec- 
tive population size. Assuming an average effective popula- 
tion size of 45,000 (Priifer et al. 2012) split time would be 5 N e 
rather than 18 N e generations, which falls squarely within the 
critical range of an upward biased dN/dS. Second, our model 
does not allow for migration after speciation, which will 
extend the influence of polymorphisms over longer time 
frames. These considerations are qualitatively consistent 
with evidence from Priifer et al. (2012) suggesting that 
approximately 3% of genetic variation in the human 
genome are cases of incomplete lineage sorting with respect 
to bonobo or chimp. We thus conclude that for human- 
chimp and lineages with similar or even shorter divergence 
histories, polymorphisms are an issue and need to be consid- 
ered for correct inference of selection pressure. With some 
knowledge on divergence time and effective population sizes, 
our model can in principle be used to rescale dN/dS accord- 
ingly and correct for the bias. 

The second, more prominent application of dN/dS is 
the quest for genes under positive selection in specific line- 
ages. Naturally, much effort has been devoted to isolate 
the genes (or gene classes) under adaptive selection in the 
human lineage (Clark et al. 2003; Bustamante et al. 2005). 
Within the context of our model, we can only discuss 
potential implications for approaches inferring selection for 



genes, and do not consider possible time dependencies of 
models inferring selection for single codon sites. Positive 
selection on genes or functional subsets of genes is generally 
inferred by comparing the likelihood of dN/dS being larger 
than in a neutral or nearly neutral scenario (Nielsen and Yang 
1998) making use of software applications such as PAML that 
are based on the continuous Markov process with instanta- 
neous fixation described earlier. These likelihood-based 
approaches used for inference on selection do not incorpo- 
rate the contribution of ancestral or lineage-specific polymor- 
phism and we may expect increased false positive detection 
for evolutionary young lineages, and, in particular, for genes 
where polymorphic sites substantially contribute to diver- 
gence. Judging from our results, we may predict which 
genes will be most severely affected. Looking at the per- 
gene level, we cannot any longer assume sequence length L 
to be large enough that dN/dS can be approximated by the 
ratio of expected values. Instead, we have to look at the 
statistical properties of the ratio of two Poisson random var- 
iables. Here, our results suggest that estimates of dN/dS will in 
particular be biased by polymorphisms if 1) the mutation 
pressure is low or 2) sequence length is short. Moreover, 
not only the expected value of dN/dS tends to be biased 
for such genes but also the random error or the intrinsic 
fluctuations in the estimate are particularly strong for the 
same set of genes, as indicated by wide confidence bands at 
shorter time scales. As a consequence the interpretation of 
dN/dS needs caution, and likelihood ratio tests are necessary 
to account for the random error. However, likelihood ratio 
tests can only account for the random error, but not for the 
systematic bias in the expected value caused by polymor- 
phisms. This bias is expected to be strongest in genes with 
low divergence, which can either be due to recent divergence 
time, low mutation pressure, or short sequence length. 
Hence, the systematic bias caused by polymorphisms may 
at least partly explain the common observation that genes 
with low divergence are preferably found to be under positive 
selection, as has been indicated previously (Wolf et al. 2009). 

Future Perspectives 

We have here introduced an analytical model to illustrate 
the time dependence of dN/dS and aspects of the effects of 
estimating dN/dS as a ratio of two Poisson random variables. 
Our approach expands existing models on codon evolution 
and integrates the contribution of polymorphisms to amino 
acid sequence divergence. Although not explicitly formulated 
for this purpose, we hope that our model may provide 
the basis for a refinement of the underlying theory of the 
widely used McDonald-Kreitman test and might improve 
the inference on the mode and strength of selection for 
closely related lineages by jointly using polymorphism and 
divergence data. The 1000 human genome project (1000 
Genomes Project Consortium 2012) and emerging popula- 
tion genomic studies in genetic nonmodel organisms 
(Ellegren et al. 2012) demonstrate that the necessary popula- 
tion genomic data sets will soon be readily available for a 
growing number of species. 
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Materials and Methods 

A Stochastic Model of Codon Evolution 
In this section, we introduce a detailed population genetics 
Markov model of codon sequence evolution and use it for 
two main purposes. First, we find natural equilibrium rates of 
synonymous and nonsynonymous mutations and show how 
to obtain them from standard assumptions of nucleotide 
mutation. These rates provide a reference for the volume 
fractions of the two types of mutations among the polymor- 
phic sites and represent in the model an estimate of the 
number of synonymous and nonsynonymous sites found in 
data. Our second main purpose of introducing the model is to 
keep a sufficiently detailed record of all polymorphic sites over 
time to later help analyzing the rate of divergence between 
two populations over time. It is crucial to distinguish the 
contributions to sequence divergence attributed to ancestral, 
polymorphic, and fixed differences. This is what will enable us 
to count synonymous and nonsynonymous divergence 
taking into account all three of these mechanisms and in 
the end to estimate dN/dS. 

A single population consists of N individuals. Each individ- 
ual is represented by a sequence of nucleotide sites of length 
3L structured as L consecutive codon nucleotide triplets. 
Random mutation based on standard assumptions acts on 
each nucleotide in a triple and the genetic code allows us to 
distinguish synonymous and nonsynonymous mutations. The 
fate of a mutant allele is extinction or fixation determined 
by Wright-Fisher reproduction acting independently on the 
L sites with the 64 codon states at each site. Although new 
alleles which originate from synonymous codon transitions 
evolve under neutral conditions of population reproduction, 
the evolution of mutant codon alleles that are nonsynony- 
mous with respect to the ancestral codon are affected by 
selective sampling. The chance to see two or more mutations 
at the same site overlap in time will be so small that for our 
purposes is justified to study the approximative biallelic 
model. 

In the following, we will provide a detailed account of the 
assumptions for the codon mutation model and for repro- 
duction with selection weights. This level of detail is necessary 
to introduce the appropriate notation and prepare for the 
analytical description of dN/dS through time. 

A Markov Model for Codon Mutations 
We begin by fixing the numbering of nucleotides A = 1, 
C = 2, C = 3, T = 4 and an ordered list S' — . . . , 
u M } = {111,112, 113, 114, 121,122, ...,443,444}, which 
gives an enumeration of the 64 codon types. Here, 
S 0 = {411, 413, 431} is the subset of stop codons. We write 
S for the remaining elements, the sense codons, so that 
S' = S U S 0 . By applying the biological code, we associate to 
each sense codon u e S one of the 20 existing amino acids. 
The change of a nucleotide affects the first, second, or third 
position of the corresponding codon and causes a transition 
from the original codon to one of eight or nine possible target 
codons. If codon u changes to codon v in this manner the 
mutation is said to be synonymous if u and v are coding for 



the same amino acid and nonsynonymous if the amino acids 
are different. To record this information, we introduce for 
each pair of sense codons u,v e S, u ^ v, the indicator 
variables 

,, . f 0 if w and v are synonymous, 
}{u,v) = \ 

[1 if u and v are nonsynonymous. 

Mutations involving stop codons will happen with positive 
probability but will be regarded immediately extinct. 

We assume that mutation occurs uniformly and indepen- 
dently over nucleotide sites with mutation rate /x > 0 per 
site and per generation. Writing 0 for the total mutation rate 
per generation, we have 6 = 3LN/X. A codon site is said to be 
clonal when all individuals share the same nucleotide triplet 
and is said to be polymorphic if not. For the type of model 
studied here, typically the number of polymorphic sites will be 
small in comparison with the length L and hence the number 
of polymorphic sites with more than two alleles will be even 
smaller. Applying the criteria that N is not too large in relation 
to L, see supplementary text equation (14) (Supplementary 
Material online), mutation is assumed to be suppressed 
in already polymorphic sites. Hence, all polymorphic 
sites are biallelic in the sense that one ancestral and one 
derived codon coexist with frequencies summing to one. 
Mutation is reactivated at extinction or fixation of the derived 
codon. 

To find the rates of synonymous and nonsynonymous 
mutation events, we introduce a Markov chain of codon 
mutations. At the level of nucleotides, given that a mutation 
occurs at a site in one sequence of the population the nucle- 
otide changes from / to j, i, j e {1,2,3,4} = {A, C,G,T}, 
according to a transition probability matrix H = (hjj) with 
zero diagonal elements, strictly positive nondiagonal elements 
and row sums equal to one. With probability one-third 
the affected site is the first, second, or third position of a 
codon. Thus, taking into account only one-site mutations, 
the nucleotide transitions in H generate a corresponding 
Markov chain of codon mutations on the state space S' 
given by a 64 x 64 transition probability matrix M'. Then 
to account for stop codons, we replace M' = (m' uv ) with 
the modified 64 x 64 mutation probability matrix 
M = (m uv ) obtained by retaining all jumps to the states S 0 . 
More precisely, if m' uv > 0 for some v e S 0 , we put m uv = 0 
and m uu = £ veSo m' M . 

Now, we are in position to mark each mutation event 
synonymous, nonsynonymous, or stopped by decomposing 
the mutation matrix M as 

M = M syn + M non + M stop , 

where M syn collects all nondiagonal elements m ul/ for which 
the pair u, v is synonymous (J(u,v) = 0), the elements of 
M non represent nonsynonymous changes Q(u,v) = 1) and 
M stop stores the diagonal elements m uv , u = v, of M. Let 1 
be a 64-column vector of only ones and let a and b denote the 
64-column vectors 

c? = M syn 1, b = M non 1. (9) 
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In these vectors, the kth elements a k and b k are the 
conditional probabilities to obtain synonymous and nonsyn- 
onymous derived codons, given that a mutation occurs 
in codon u k . 

If we focus on a single codon site at a given generation, 
the chance to see a mutation is proportional to 3/^N (which 
we may assume is much less than one). Hence, mutation 
events occur over time according to the transition probability 
matrix M M = (1 - 3/zN)/ + 3/xNM. Adding up the I sites of 
a sequence, it follows by independence of the mutation 
mechanism that the number of mutation events in a given 
generation is approximately Poisson distributed with mean 8. 
But only a small fraction of these events result in actual 
nucleotide substitutions, as we will see next by adding repro- 
duction and selection to the model. 

Discrete Time Wright-Fisher Model with Selection 
For the reproductive dynamics of the model, we make the 
simplifying assumption that there is free recombination, 
that is, no linkage, between sites of a sequence. Each new 
generation is obtained from the previous generation by 
Wright-Fisher sampling acting on codons such that all 
codon sites develop independently of each other. Hence, a 
clonal site remains clonal until a newly mutated codon allele 
enters in one individual of the population. At this instance, 
the site becomes polymorphic and remains so over a period 
of time during which the frequency of the derived codon 
evolves according to a Wright-Fisher Markov chain until 
absorption. If the underlying mutation event is synonymous, 
then reproduction is neutral whereas if the mutation 
is nonsynonymous then the derived and ancestral co- 
dons are sampled with the selective weights 1 + s and 1, 
respectively. Typically, we consider selection to act deleteri- 
ously, prohibiting nonsynonymous changes by letting the 
selection parameter s be negative, — 1 < s < 0. This is, 
however, no restriction as the model covers positive selection, 
s > 0, as well. At the time of absorption, the site becomes 
clonal. 

To summarize the dynamics of codon evolution in the 
population, we keep track of L triplets W' n = (A' n , B' n ,X' n ) e 
S x S x [0,N], /' = 1, . . . ,L In eacn generation n, A' n 6 S is 
the type of the ancestral codon at site /', B' n is the type of 
the derived codon if / is polymorphic and equal to A' n if / is 
clonal, and X' n is the number of individuals with the derived 
codon allele at site /'. By construction, the components 
(W' n ) n>0 , i = 1, . . . ,L are independent and identically 
distributed discrete time Markov chains with state space 
S x S x {0, . . . ,N}. The one-step transitions of the single 
codon site chain are as follows. For mutation, jumps 
(u,u,0) — > (u,i/,1) are governed by the transition matrix 
and occur with probability 3fiNm uv , v ^ u. For repro- 
duction, we let Y denote a random variable with the binomial 
distribution Bin (N,p), where p = p(u,v,x) is the sampling 
probability 

_ x(1 + s)(u,v)) 
P ~ N + sx)(u,v) ' 



Then, the jumps and corresponding transition probabilities of 
Wright-Fisher sampling are given by 

(u,v,y) with prob. P(Y = y), 1 < y < N - 1 
{u,v,x) -> I (u,u,0) P(Y = 0) 

(v,v,0) P(Y = N). 

Continuous Time Approximation of the Full Model 
Next, we consider large population size N and apply to each 
discrete time single site Markov chain {W' n ), 1 < / < L, the 
standard scheme of approximation (A[ Ntl ,B[ Ntl ,/\T 1 X[ Ntl ) 



under the change of time and scale given by 

n^Nt, s^y/N. (10) 

For the third component, it is well known that the derived 
allele frequency in the Wright-Fisher model with selection 
and no mutation converges as N — > oo to a diffusion process 
with absorbing boundaries in 0 and 1 given by the solution of 
the stochastic differential equation 

di;t = Yi;t0 - 6) dt + -&)dB t , 
where (B t ) is Brownian motion. Here, £ 0 is the initial fraction 
of derived alleles. In our case £ 0 = 1/N 0, which suggests 
that derived alleles would go extinct immediately. In our 
approach, however, the large population size scaling 
(eq. 10) is balanced against a total mutation intensity of 
order 9N/L per codon site. Thus, we replace the frequency 
N _1 X| Nt j by a process (X[) where each nonzero excursion 
follows a path of (f t ), with £ 0 = 1/N > 0 (the same approx- 
imation is used in Evans et al. [2007]). Then during the 
clonal periods, for which X\ = 0, the first two components 
in (A| Nt j , B| Nt j , 0) are continuous time Markov chains 

(A! t , B' t ,Q) (forced to have A' t = B' t ) in holding until the 
next jump of £>'. As the jump probability per generation is 
3/xN, it follows that the jump rate per N generations is 
3/xN 2 = 0N/L. Hence, the generator matrix of B' equals 
M Nfl — I. Each jump of B\ leaves (A' t ) unaffected but initiates 
a diffusion path (£ t ) embedded in X\. If the faith of (f t ) is 
extinction then B\ returns to its previous value stored in (A' t ). 
If instead the path (f t ) gets fixed then A t attains the current 
value of B\. More explicitly, we are approximating the discrete 
time Markov chain (W n ) with a continuous time Markov 
process (Wj) = (A t , B\, X\) with state space (S,S,[0,1]). 
The state space is a mixture of two jump coordinates and 
one continuous state coordinate and the process has the 
specific feature of holding and jumping from the boundary. 
We provide background information on diffusion processes 
with holding and jumping boundary in the supplementary 
text (Supplementary Material online). In our case, the bound- 
ary consists of all points (u,u,0), u 6 S. If the current state 
of Wj is (u,u,0) then after an exponential holding time of 
rate ON /I the process (B t ) jumps to state v with probability 
m uu . At the instance of such a jump (W t ) begins tracing 
a path of (u,v,(^ t ))> with 



d| t = y)(u,vMl ~ Ht) dt + A(1 - |t) dB t , £ 0 = 1 1K 
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until time of absorption. If the absorption event is fixation, 
the process jumps to (v,v,0), if it is extinction, the jump is 
to (u,u,0). Then again, the jump intensities apply until the 
next (f t )-excursion takes place, and so on. The summation 
measure process 

L 

now models the codon distribution in the entire sequence 
of length L across the population and over time, and provides 
a site frequency spectrum for the ensemble of clonal and 
polymorphic sites in the sequence. Each term features a 
record (A' v B' t ) of ancestral and derived codon types plus a 
process (A" t ), which alternates between dormant periods 
and active sessions. In each cycle, the dormant period has 
exponential length with intensity 9N/L and the active session 
consists of a Wright-Fisher diffusion corresponding to the 
absorption time of a Wright-Fisher diffusion with initial 
value 1 /N running until absorption in 0 or 1. 

The Codon Equilibrium Distribution 

The first component (A' t ) of each (W t ) has its jumps 
restricted to times of actual nucleotide substitution events. 
Synonymous mutations get fixed with probability 1/N and 
nonsynonymous mutations fix with probability a> Y /N, y ^ 0, 
with a>y introduced in equation (2) (the formal background is 
given in eq. 2 of the supplementary text, Supplementary 
Material online). So, if we single out only substitution 
events then the Markov chain transition probabilities 
reduce to (M syn + tt> y M non )/N. Thus, we consider a continu- 
ous time Markov chain with infinitesimal generator 

Q K = 3/zN(M syn +co y M mn - V dia «), 

(V dia % = a k + ft > K fa k 

where the total jump intensities stored in the diagonal matrix 
V diag were introduced in equation (9). Here, the stop codons 
S 0 are transient states. We conclude that the continuous 
time approximation A' t behaves as the Markov chain with 
generator Q y , except that each holding time is prolonged 
by the fixation time of the corresponding diffusion path in 
X\. To obtain a steady state codon distribution, however, 
one should restrict to the clonal population, which is given 
precisely by the Markov generator Q Hence, we define r\ 
to be the unique stationary distribution which satisfies 
77Q y = 0 and which provides a steady state for the irreduc- 
ible Markov chain restricted to S and with % = 0 for Uk e So. 
Now for large N, we have the interpretation that substitutions 
occur according to Q K and the typical codon frequencies 
observed in a clonal site in equilibrium is given by r\. 
Furthermore, in this equilibrium, we can measure the propor- 
tions of synonymous and nonsynonymous events among 
all mutations. For example, whenever a mutation event occurs 
according to M it is synonymous with probability given by the 
scalar product {r],a) = r\ M syn 1. Consequently, we introduce 
the probability distribution tc = (jt syri ,Jt nori ,7T st0 p), by 

;r syn = ^"1, 7r non = ?yM non 1, 7r stop = )? M stop 1. 

(13) 



In conclusion, the typical rates at which mutation events are 
synonymous, nonsynonymous, or inert are obtained as the 
weighted mutation rates 

Msyn ~ M^syn? A^non — M^non? Mstop — P^stop} 

and the conditional distribution of synonymous and nonsyn- 
onymous mutations given that a nonstop codon transition 
occurs is 

^syn ^non 
Psyn — •> Pnon — ■ 

7T S y n + 7T non TTsyn ^non 

As a consequence, the resulting synonymous and nonsynon- 
ymous mutation intensities for the population of sequences 
are given by 

#syn = ^syn^ = 3LN/Z syn , 9 non = 7T non 0 = 31_N/X non . 

(14) 

Mutation Rates for Standard Models 
The probability distributions n and (p syn ,Pnon)< an d hence the 
rates 0 S yn,#non can be found explicitly for standard mutation 
matrices H. The Kimura model with a + 2/3 = 1, takes into 
account a mutation ratio k = a/ ft of transitions versus trans- 
versions. For this model, the uniform distribution on nonstop 
codons given by r\k = 1 /61, Uk £ S, is stationary for M and 
hence Q r This holds not only for the neutral case y = 0 but 
also in general. Indeed, Q y is doubly stochastic for any y, 
and by equation (13), 

/12 26 46 22 3 4 \ 

7T = 1 a, a, a I, 

\61 183 61 183 61 183 / 

36 + 26a 138 - 220! 

Psyn — ~, s Pnon — „_ , " ■ 

y 174 + 4a 174 + 4a 

A special case of the Kimura model, which we have used for 
the simulation study in this work, is a = 1/2, /3 = 1/4, 
k = 7. Then 

/49 127 7 \ 49 

JT = , , , p syn = = 0.2784, 

\183 183 183/ y 176 

127 

Pnon = — = 0.7216. 
176 

Our model is flexible and allows for any other nucleotide 
substitution pattern, including asymmetric versions with 
nonzero diagonal elements h i: . In general, if p = 
(Pi Pi Pi Pa) is a steady state for H, hence representing 
the typical fractions of nucleotides in the population, the 
corresponding steady state of codons will be 

riu = c • piPjPk, u = (ijk), c _1 = 22 pipjp k , 

u 

and the vector it is again found from equation (13). 
Commonly used versions of the Goldman-Yang model 
(Goldman and Yang 1994), fall in this category. 

Poisson Random Field Approximation 
In equation (11), letting the initial times and paths of all active 
sessions form points of a Poisson random point measure leads 
to what has been called a Poisson random fields model 
(Sawyer and Hartl 1992). Although we do not pursue this 
line of argument formally, our approach is similar in spirit. 
The quantities of primary interest in this work, which measure 
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divergence of sampled sequences between populations, are 
recognized as random functionals of A t . These functional 
count specific events that occur along the site processes with 
probabilities proportional to 0/L. With L independent sites, 
this leads to an approximate Poisson number of events over 
the total length of the sequence with mean proportional to 0. 
To find the expected values of the divergence functionals, 
we apply the site frequency spectrum, which arises by super- 
posing Wright-Fisher diffusions starting in 1/N at Poisson 
times with rate ON (Evans et al. 2007). 

Recall from equation (12) that the typical codon frequen- 
cies are given by the steady state {i-] u ,u e S \ S 0 }. 
Conditional on the first component of A t in site / being 
A! t = u, then B\ typically makes a large number of jumps 
from u to a sense codon neighbor v of u and then back to u, 
before one of these excursions eventually fixes and results in a 
change of state of A! t . Given such a u, B\ settles in a steady 
state with probabilities m uv . Summing again over u with the 
stationary weight, we recover ^ u n u m uv = n v . Therefore, it is 
reasonable to assume that mutations occur along the time 
axis as an approximate Poisson process with intensity NO and 
each event sparks the excursion of a Wright-Fisher diffusion 
(f s ) with §o = 1 /N. The fractions of mutation events, which 
lead to synonymous and nonsynonymous codon pairs are 
obtained by the weighted summations 

u,i/eS\S 0 u,i/€S\S 0 

and the corresponding mutation intensities are given by 
N# syn and N0 non . A key feature here is that nonsynonymous 
and synonymous codon mutations evolve independently 
of each other. 

Rate of Divergence over Time 

In this section, we consider a population split where a 
population of size N has been running indefinitely from the 
past and is replaced at time t-0 instantaneously by two 
identical copies of the population. The new branches repre- 
sent two emerging species both of population size N with 
initially the same number of clonal and polymorphic sites 
and with identical codon frequencies. From the splitting 
time and onwards each of the two populations evolve inde- 
pendently according to the same mechanisms of mutation 
and selection. To follow the onset of divergence between 
the species, we sample randomly one individual in each 
population. Let D(t) denote the number of nucleotide 
differences between these two sequences at time t > 0. We 
will analyze three types of differences which contribute to 
the total divergence D(t), by writing 

D(t) = Dfc(t) + Dpoi(t) + D anc (t), 

where 

Dfj x (t) = sequence divergence at t from mutations during 

[0, t] fixed uring [0, t], 
Dp 0 |(t) = number of derived alleles sampled from 

lineage-specific polymorphic sites at t and 
Danc(0 = sequence divergence at t attributed to 

ancestral polymorphisms existing at t = 0. 



Here, Dj; x (t) and D po |(t) are sums of two independent 
contributions, one from each population, whereas D anc (t) 
involves the joint initial state at t-0. The dominant source 
of divergence between two sequences which is visible after a 
longer time span t, is the fixation of new alleles from recent 
mutations in each population during (0,t). The growth in the 
number of substitutions and the subsequent growth of Dfj x (t) 
is essentially linear in t. This is the same mechanism, which is 
responsible for the mixing of codons and the appearance of a 
steady state of codon frequencies in the long run. The addi- 
tional contributions to the total divergence D(t) are bounded 
as functions of t but are important to understand how the 
linear growth regime is attained after population split. 

We will analyze the three types of divergence by relating 
the components of D(t) to suitable functionals of A t in equa- 
tion (11), extended to cover a common ancestry for t < 0 
and two independent species populations for t > 0. At any 
given time, some of the L codons in the model are likely to be 
polymorphic and hence exempt from mutation events. But as 
the number of polymorphic sites is typically much smaller 
than L, we will apply the approximation that the total muta- 
tion intensity is 0 = 3fj.NL per sequence and generation, 
hence NO per sequence and time units t. It is the independent 
Poisson mutation processes in our model that drive the 
various contributions to D(t). In particular, nucleotide substi- 
tutions count into Dfl x (t), which therefore has a Poisson 
distribution. Sampled differences, both ancestral and present 
polymorphic, arise at most one in each codon site. By equa- 
tions (15) and (16) of the supplementary text (Supplementary 
Material online), the probability to see one of these differ- 
ences at a given site after sampling sequences in two popu- 
lations is proportional to the mutation intensity per codon 
and time unit, namely 0/L. Summing over L codons this gives 
a binomial, hence approximately Poisson, number of differ- 
ences with mean proportional to 0. But then D po |(t) and 
Danc(0' an d therefore D(t) itself have approximate Poisson 
distributions. Our next focus will be to find the corresponding 
expected values. 

Expected Divergence after Population Split 
We begin with divergence based on fixed differences. Our 
model associates with the Wright-Fisher diffusion (§ t ) its 
fixation time x y , extinction time t 0 and absorption time 
t = min(r 0 ,r 1 ). The total number of mutation events in 
[0,t) is Poisson with mean NOt and conditional on this 
number the events are uniformly distributed on [0,t). One 
such mutation occurring at time s results in a fixation if 
s + T| < t. Summing over both populations and taking into 
account the fractions of synonymous and nonsynonymous 
events, this gives 

ED fix (t) = 2N0t ■ - f Py N (s + T-i < t) ds 

t Jo 

= 2N0 syn f P>° ( T1 <s)ds 
Jo 

+ 2N0 non [ P yN (Ti <s)ds 
Jo 
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Rewrite as P^ /N fa < t) = P*} N (ti < t)q y (1/N), where 
P^t-i < t) is the conditional probability given fixation 
(ti < oo) and q Y (x) = P y (Ti < t 0 ) denotes the fixation 
probability of a mutated allele which emerges with frequency 
x in the population. By equation (2), q Y (1/N)~a> Y /N. 
Hence, in the large N limit, 



ED fix (t) = 2fl syn C(0 + 20non«vC\O 



with 



O) 



Jo 



s)ds, d™"(t) 



f PS"(T1 
Jo 



< s) ds. 



Equivalently, 

C(0 = t- 



E^minfr ,t)), d™ n (0 = t - E^minfo ,t)), 

which reveals the deviation from linear growth of ED fix (t). 
For y = 0, 

E*°(min(T 1 ,t)) = C 0 (t), 

with the function C 0 denned in equation (12) of the supple- 
mentary text (Supplementary Material online). One option 
for the general selective case y^O would be to apply a 
spectral representation for the transition density of the 
Wright-Fisher model with selection, and rely on numerical 
computations of the corresponding eigenvalue/eigenvector 
problem. To keep things simpler while retaining a reasonable 
degree of accuracy, instead we propose at this place the 
approximation 

E^minC^t)) « G y (t), G Y (t) = min(Co(t),E* y (r 1 )). 

By applying a known integral expression for E* y (r-|), see 
Karlin and Taylor (1981), Ch. 15, (9.9), and expanding the 
resulting integral in y, we obtain 



Eo 7 (*i)= / 
Jo 

1 , 7 



y(1 -y)y(1 -e- 2y ) 



dy 



(15) 



and hence 



In summary, 



= 2--y 2 + — y 4 + 0(y 6 ) 
9 675 



C Y (t) « min(C 0 (t),2 - y 2 /9). 



ED fix (t) « 20 syn (t - G 0 (0) 

+ 26 nm (D Y (t - min(C 0 (t),2 - y 2 /9)). 



(16) 



Next, we turn to divergence based on lineage-specific poly- 
morphisms. As discussed earlier, consider a mutation at a 
uniformly distributed time s in [0,t). The corresponding 
derived allele exists at time t if s + x > t, in which case 
0 < |* < 1. In each population, we have sampled one par- 
ticular individual. The probability that this individual carries 
the derived allele is approximately |*. Hence, the probability 
that this lineage-specific polymorphism contributes to 



estimates of divergence is 

,s + t > t) ds 



pmm(t,x) 



1 

t EVN 



r > r) dr 

min(t,r) 



Now summing over all mutation events in [0,t] in both 
populations and splitting neutral synonymous ones from 
selective nonsynonymous mutations, we find 

ED po ,(t) = 2V#(t) + 20 non d™(t), 

where 

_ /*m\n(t,z) _. 
^ (t) = Nl m oc NE v4/ 0 H" 



(17) 



lim NE y 



/.min(t,T) 

1 H- 



To continue estimating cK(t) and d™"(t), it is a useful fact 
that the functional E y [/ Q r | s ds] has an explicit representation 
as an integral over the corresponding Green's function (for 
details see supplementary text, Supplementary Material 
online). Indeed, for y = 0, we have NE° /N [/ Q T | s ds] 2 
and for 

j | s dsJ=<w y j — dy, (18) 



lim NE y 



as N tends to infinity. To accommodate the behavior for 
small t, we apply the approximation 



E; 



/•min(t,t) / pt pr 

J | s dsj « min f Jf E y [| s ] ds,E y [ jf | s dsj 



(19) 



Here E°[| s ] = x for y = 0, so for large N 

/■min(t,T) 



min IN 




fe]ds,NE y /N 




min(t,2). 



More generally, by conditioning 

TO = E x K fe I *i < lb] q y (x) + E y [| t I To < T,] (1 - q K (x)). 

The conditional expectations on the right hand side appear 
to be well approximated by those for the neutral case y = 0, 
which can be derived explicitly, 

E y [| t | n < T 0 ] « E°[| t | n < T 0 ] = 1 - (1 - x)e- ( , 
E y fe I To < Ti] « E°[| t | T 0 < n] = xe-'. 
Thus, 



TO«4yW0 -e- f )+xe- 



(20) 



and 



N 



Jo 



[| s ] ds % <w y t + (1 - aj y )(l - e '). 
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and if we plug this and equation (18) into equation (19) it 
follows 



d£(t)«NE? /N [j[ f s ds] 



min &> y t+(1— &> y )(1— e ) 



Thus, if we apply the integral approximation 



,COy / - 

Jo 



-2yy 



yy 



-dy 



L 



■\- e - 2 yy e~ 2v - 1 +2y + 2y 2 

yy 2y 2 



which should be sufficiently accurate at least in the range 
| y | < 2, and put 



H y (t) = min 1 1 + 



1 - ftj v 



O-e" 1 ) 



e" 2)/ - 1 + 2y + 2y 



2y 2 



(21) 

then d[j° n (t) » 0J K H K (t) and we may sum up the contribu- 
tion to divergence based on polymorphic sites as 

ExDpoi(t) « 2<9 syn min(t,2) + ie non co y H Y (t). (22) 

The ancestral contribution to divergence between the two 
populations has its origin in the common history of mutation 
events that has occurred at times s < 0. Of all polymorphic 
sites which exist at the time of population split t-0, those 
for which the sampled individuals at a later time are different 
in the two populations add to ancestral divergence. This 
includes polymorphic and fixed or extinct states as long as 
one is ancestral and the other derived. A derived allele starting 
at s < 0 exists with frequency §J > 0 at time t-0, if 
s + t > 0. Hence, conditionally given t- s 0 , the chance to see 
such a difference is 



f-Oh 



E^o-fPVro-^)] 

where each population is indicated with an additional upper 
index. By independence after time t-0 this is 

2<(^)(1 - <(x) = Efl&I, m°(x) = x. 

To average over all states at t = 0, we note that the number 
of mutation events over a finite interval (— K,0) is Poisson 
with mean NOK. Hence, conditional on the number of events 
the mutation times are uniformly distributed in (— K,0). 
The probability that one of these events contributes to 
ancestral divergence is 

~ f E x [i {s+T>0} 2<(r 0 )(i -<(r 0 ))]ds 

K J-K 

= E x [V <T) 2m>XO0 " <(#))] Ar - 

Letting K —> oo, and separating synonymous and nonsynon- 
ymous events, 



ED anc (t) 



2<9 syn NE' 



°y N f f,0-fe)ds 
Jo 

+ ie nm NE y yN f mffe)(1 - m^fe)) ds. 
./o 



(23) 



Again, we rely on being able to evaluate functionals of the 
Wright-Fisher process of the type E° / Q r g(£ s ) ds for suffi- 
ciently nice functions g. First (eq. 5 of the supplementary 
text, Supplementary Material online), 



Jo 



fe(1-fc)d$-> 1. 



For the second term in equation (23), we use again the 
approximation (20) of m t y (x). Then 



-mr(y)0-<(y))dy, 



E° /N f mffe)(1 - mrfe))ds -> a) y ^(t) s 
Jo 

with 

mf(y)(1 - mf (y)) 

« (1 - e- f ) 2 q y (y)(1 - q y (y)) + e- 2t y(1 - y) 

+ e- ( (1 - e- f )q y (y)(1 -y) + e- f (1 - e^O - q y (y))y. 

By expanding the resulting integrals in a series up to second 
order in y, 



Jy(i) « 1 - 3 (1 + e- f ) - 0 - 10e-' + 3e- 2t ). (24) 

Hence, we conclude from equation (23) that the expected 
divergence attributed to ancestral polymorphisms in this 
model is 



ED anc (t) = 26 syn + 26 non a)y) y (t). 
At t = 0, as m Y 0 (x) = x, 



Mo) 



Jo 



-2Ki-y) 



-dy 



oj y - 1 
y<o y 



(25) 



(26) 



so that 



ED anc (0) = 2# syn + 26 

non " 



and we have the initial ratio 



ED£(O)/0, 



ED^(0)/ft 



syn 



1 + - y — y 

V 45 r 



Now, we are in position to study synonymous and nonsyn- 
onymous divergence separately. By summing up the three 
parts of divergence in equations (16), (22), and (25), that is, 
fixed differences and divergence attributed to lineage-specific 
and ancestral polymorphisms, we find for any fixed t that 
D syn (t) and D non (t) have approximate Poisson distributions 
with expected values 

ED syn (t) = 20 syn cf yn (t), ED non (t) = 2e non d non (t), (27) 

such that 

Gf syn (t) = t - G 0 (t) + min(t, 2) + 1 

and 

d mn (t) = o) y (t - min(G 0 (t),2 - y 2 /9) + H y (t) +J y (t)) 
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with C 0 defined in the supplementary text (Supplementary 
Material online), H y in equation (21) and ) y in equation (24). 

Statistical Properties of Divergence and Poisson Ratios 
The codon evolution model introduced above allows us to 
study in detail the accumulation of fixed differences and the 
variation of sampled differences between sequences from two 
independent populations after the split from an ancestral 
population. We continue in this direction by studying the 
ratio of nonsynonymous to synonymous divergence dN/dS 
in a population genetics framework as a ratio of Poisson 
random variables. 

Intrinsic Variation of Poisson Ratios 
Considering the ratio of scaled Poisson variables 



P n ° n (t)/flnon 



on D syn (t) > 0, 



enables us to study dN/dS as a function of divergence time 
by defining 

dN/dS(t) = I D sy "(t) >o) 9 ^. 

To compute this conditional expectation, we note that if Y is 
a Poisson random variable with mean m then 

E(1/Y | Y > 0) = Y\P(Y= k\ Y > 0) 



fc=1 



1 nre 



^'0 

Hence, if we introduce the function 



C(m) = mE(1/Y|Y>0) = g ) , kM _ e _ m , m > 0, 

(28) 



it follows 

dN/ds(t) = e^D m \t)0 sy ^[D sv "(ty 

d non (t) 



D syn > 0] 



c/ s v n (t) 



C(20 syn d syn (O)- 



(29) 



Thus, we have two alternative methods to estimate dN/dS 
based on computing suitable expected values. The previous 
ratio of expected values d non (t)/d syn (t) and the new expected 
value of a ratio. Our main motivation for introducing equa- 
tion (29) is that current practice in empirical work on dN/dS 
does not per se seek to estimate the expected number of 
synonymous divergence or expected number of nonsynony- 
mous divergence but rather aim at determining single obser- 
vations of these quantities. Then forming the ratio of these 
observations leads to equation (29). It is noteworthy that 
the new estimate is simply a multiplicative perturbation of 
the previous ratio and that the prefactor C(20 syn d syn (t)) 
depends on both 0 and t. However, the dependence is 
restricted to variations in the product 20 syn d syn (t). If t > 2, 



this is 27T syn 0(1 +t) and if we take in addition the Jukes- 
Cantor mutation model, then the prefactor is approximately 
C(6>(1 + 1)/2). For example, if 0(1 +0^8 then dN/dS(t) is 
more than 30% larger than dN/dS(t) tota , and if 0(1 + 1) « 20 
then dN/dS(t) is more than 10% larger than the reference 
ratio. This said, one should also notice that for sufficiently 
large 9 or t, the two estimates coincide. In fact, C(t)~1 
as t — > oo and hence dN/dS(t) « d non (t)/tf yn (t) as 
#syndsyn(0 oo. For small t, 

dN/ds(0^M°K( 2 M> t^°- 

Confidence Intervals 

Suppose we have fixed 8 and H, and determined the vector jt. 
Considering y as an unknown parameter we suppose k and 
n — k are observations of the independent Poisson random 
variables D syn (t) and D non (t) at some unknown time t. 
Then, given the sum n, D syn (t) has a binomial distribution 
Bin(n,p(t)), where 



p(i) = 



2(9 syn cf yn (t) 

2 0syn d syn( t ) + 26 non d non {t) 



1 + 



7T non d n0n {t)\ 
jj-syn >yn( t ) J 



(30) 



Now apply a confidence interval [a,b\ for p(t). It follows 
that [7r syn (1/b - 1)/jr non , 7T syn (1/ci - 1)/7T non ] is a confi- 
dence interval for d non (t)/d syn (t). If we apply specifically a 
95% interval of the so-called Wilson type for p(t), based 
on normal approximation of the binomial distribution, then 
with z = 1 .96 the limits a and b are given by 



,n 2n V n 3 




The lower und upper limits of the observed confidence 
interval for d non (t)/d syn (t) become 



7T syn 
7T non 



/ 



1 + 



V" 



±Z 



k(n-k) 



+ 



\ 



J 



For example, if the upper limit would attain a value less than 
one we have evidence with a 5% degree of significance to 
reject a null hypothesis of neutral evolution in favor of 
negative selection. 

Now we change the perspective and consider y and hence 
d non (t) as known. We want to estimate the variation in 
D non (t)/D syn (t). Given that the total number of observed 
differences at time t is n, then with p = p(t) 



r p _ p0-p) < ont) <p+ pO-pV 
i V n n V n , 



P p + z 



p(i -py 



-1 < 



D non (t) 
D s v n (t) 



p(i -py 



-1 = 0.95. 



229 



Mugal et al. • doi:10.1093/molbev/mst192 



MBE 



Hence, we may take p as in (30) and estimate n with 
n = 26{Tt^ n d^{t) + Tt nm d non (t)) to get a 95% "confidence 

nnon/>\ I anon 

band" for ° WL as 



p ±z 



p(1 - p) 




(31) 



Simulation of Codon Evolution 

The accuracy of our analytical results of course depends 
on the degree to which the various approximations that 
we applied during their derivation have distorted the prop- 
erties of the original model. In particular, our results were 
derived as large population size approximations based not 
only on the rescaling of time and mutation and selection 
parameters but also on Poisson approximations that ignored 
some of the subtler dependency structures in the model. 
Hence, one must ask if the dN/dS ratios derived here cor- 
rectly capture the relation of nonsynonymous to synony- 
mous changes over discrete generations as it evolves in the 
initial modeling setup. Furthermore, one would like to know 
whether the confidence bands in equation (31) derived with 
similar approximation methods reflect the true statistical 
variation in the original model, or not. 

For the purpose of validating our analytical results, and 
hence providing evidence that our dN/dS ratios with rea- 
sonable accuracy indeed capture both the average behavior 
and the variation of nonsynonymous to synonymous diver- 
gence, we carried out a simulation study of the discrete 
time Wright-Fisher model with selection as introduced 
previously in the Materials and Methods. The code was 
written in Matlab (R2012b) and simulates the Markov 
chain (W^ n , . . . ,w£) n>0 with the following choice of 
parameters: N = 500, L = 2000, fj, = (1 /3) x 10~ 6 , s = 
—2 x 10~ 3 , and the mutation matrix H chosen to be 
the Kimura matrix with a = 1/2, /3 = 1/4. As a conse- 
quence, we have the scaled parameters 8 = 1 and 
y= —1. The initial codon distribution was chosen to be 
clonal according to arbitrary codon usage. In other words, 
each independent component W' n in the codon sequence 
is given the initial distribution W' Q = (u,u,0) with u arbi- 
trarily selected. Then, a single population was generated 
during a burn-in period of 10,000 generations (20 time 
units) to move toward equilibrium codon usage. The par- 
ticular configuration of codons and its polymorphic states 
constitutes the distribution of shared ancestral polymor- 
phisms. Then, two populations evolve in parallel but oth- 
erwise independent over 50,000 generations (100 time 
units). During the generation of these data, the code 
keeps track of the accumulated number of fixations as 
well as the resulting ancestral and polymorphic divergence 
if one sequence had been sampled from each population at 
that particular time. With the additional knowledge of the 
type of each fixed or sampled difference — nonsynonymous 
or synonymous — one obtains the simulated dN/dS ratio as 
a function over discrete time. 



Supplementary Material 

Supplementary text is available at Molecular Biology and 
Evolution online (http://www.mbe.oxfordjournals.org/). 
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